The first date of the MA genomic data is 2020-01-29 and the last date is 2023-01-31. The number of unique SAR-CoV-2 lineages during this period is 693. If we look at (Nextstrain) clades instead, there are 31 clades.
Each sequenced sample has a date the sample was taken and the date the sample was submitted. The reporting lag for each sample is the time elapsed between those two dates. Our training period starts at the beginning of March 2021, in which the median lags seem reasonable.
There are 693 unique lineages in the data, below are the monthly boxplot number of unique lineages on a given date. There were a large number of lineages circulating in the later half of 2021.
Here we filtered the data to only include 5 lineages with the highest proportions in the samples taken on a given day for readability. We also truncated the dates to the period between March 2021 to Jan 2023.I decided to take out the legends since there are 315 unique lineages, the legends and the plot are still not very readable.
Due to a lot of variability in the unique lineages being sampled daily, we need another way to make the plot more readable. We will only include 5 lineages with the highest proportions in the samples taken on a given epiweek (instead of day), so we can exclude lineages that are not consistently detected. With this, there are 88 unique lineages below
We divided the total number of samples taken on a given day by the population of MA, which is a constant number of about 68 (multiplied by 100K). Again we will focus on the period starting March 2021 and after. The rates were fairly low from the later part of 2022 onward. The sampling rates will be used as a feature and probably can use some smoothing (smooth line in the figure generated from a cubic spline).
By now I realized that the daily sampling rates fluctuate so much that the lineages detected varied greatly day to day. I decided to look at the daily growth rates of 5 lineages with the highest proportions in a given week (there are 88 unique lineages during this period), so we have relatively stable sets of lineages. Well, it’s still not readable.
And I did the same but with 5 lineages with highest average daily growth rates in a given week. There are 209 unique lineages, too many to be readable on the plot below.
Have to figure out the minimum number of clades during these dates.